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Abstract 

Low-density parity-check codes are attractive for high throughput applications because of their low 
decoding complexity per bit, but also because all the codeword bits can be decoded in parallel. However, 
achieving this in a circuit implementation is complicated by the number of wires required to exchange 
messages between processing nodes. Decoding algorithms that exchange binary messages are interesting 
for fully-parallel implementations because they can reduce the number and the length of the wires, and 
increase logic density. This paper introduces the Relaxed Half-Stochastic (RHS) decoding algorithm, a 
binary message belief propagation (BP) algorithm that achieves a coding gain comparable to the best 
known BP algorithms that use real-valued messages. We derive the RHS algorithm by starting from 
the well-known Sum-Product algorithm, and then derive a low-complexity version suitable for circuit 
implementation. We present extensive simulation results on two standardized codes having different 
rates and constructions, including low bit error rate results. These simulations show that RHS can be 
an advantageous replacement for the existing state-of-the-art decoding algorithms when targeting fully- 
parallel implementations. 
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I. Introduction 

Low-density parity-check (LDPC) codes can approach channel capacity with a low decoding 
complexity per bit, making them attractive for a wide range of error correction applications. For 
most applications, the decoding operation must be performed by a custom circuit implementation 
because of the processing performance requirements. For a given coding gain, we seek to optimize 
the processing performance (throughput, latency) normalized to the circuit area. 

The decoding of LDPC codes is part of a large family of problems that can be solved by 
iteratively passing messages in a factor graph (H. When the desired decoding throughput is high, 
the most efficient implementation approach is to explicitly map the factor graph in hardware. This 
is known as a fully-parallel implementation. Because of the structure of LDPC factor graphs, 
the wiring complexity typically represents a large portion of the implementation complexity. It 
has a big impact on area requirements, as well as dynamic power consumption. Furthermore, the 
longer wires are likely to form critical paths and constrain the maximum clock frequency. We 
will call a circuit graph a graph where a node corresponds to a localized circuit block, and an 
edge corresponds to a connection between two circuit blocks, composed of one or several wires. 
In the simplest fully-parallel implementation, the circuit graph is identical to the factor graph. 

Different approaches have been proposed to reduce the wiring complexity of fully-parallel 
LDPC decoders. Circuit architectures have been proposed [0, j3]| that send messages serially 
on a single wire, at the expense of using a larger number of clock cycles to exchange mes- 
sages. However, this does not change the topology of the circuit graph, which can still lead to 
routing congestion. In Split-Row decoders flU, the topology of the circuit graph is modified by 
partitioning check nodes into multiple sub-nodes that are then linked by two or four wires. The 
authors show this topology change can have a big impact on area and power requirements of the 
decoder. However, their approach suffers from increasing error rates as the number of partitions 
increases. Finally, stochastic decoding algorithms have been introduced as a low-complexity 
alternative to the standard algorithms. The technique was initially demonstrated only on small 
codes 0, but since then stochastic decoding has been applied to longer LDPC codes used in 
communication standards flU. Since they rely on binary messages, stochastic algorithms have a 
low wiring complexity. They also use a very simple check node function, which can be partitioned 
arbitrarily without introducing any approximation. However, these stochastic algorithms suffer 
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from a high latency, and an important loss in coding gain with respect to Sum-Product Algorithm 
(SPA) decoding. 

In this paper, we introduce a new BP algorithm that is a binary message passing (BMP) algo- 
rithm, which for an LDPC decoder means that the algorithm is constrained to using only modulo-2 
addition as the check node function. Hard-decision decoding algorithms such as "Gallager-B" and 
the stochastic algorithms mentioned previously are BMP algorithms. The Relaxed Half-Stochastic 
(RHS) algorithm relies on stochastic messages for exchanging information between processing 
nodes, but differs from existing stochastic algorithms in the way messages are generated, and 
in the operations performed in the variable nodes. The name "relaxed" comes from the use in 
the variable node function of an estimation mechanism that is similar to the relaxation step in 
a Successive Relaxation decoder [7]. Our results show that despite the constraint on the check 
node function, it is possible to achieve very good performance, in terms of error rate, latency, 
and throughput, while preserving a low implementation complexity. The RHS algorithm can 
match or even outperform the error rate of BP algorithms that use real-valued messages, such as 
the Sum-Product algorithm. Furthermore, an implementation only requires addition operations 
for the variable node function, XOR operations for the check node function, and two wires per 
(bi-directional) edge of the circuit graph. 

A preliminary version of the RHS algorithm was introduced in [8]. In this paper, we present 
several improvements that increase the performance while reducing complexity. In Section HH 
we briefly review the Sum-Product, Min-Sum and Normalized Min-Sum algorithms, and review 
the stochastic message representation. In Section [Till the algorithm is developed using the Sum- 
Product Algorithm as a basis. Following this, we show how the data throughput can be increased 
by changing the decoding rule at some pre-determined iterations, similar to Gear -Shift decoding 
0. We also present a method for lowering error floors that can be used with RHS or SPA 
decoders. In Section [TV] we derive a low-complexity implementation from the ideal case described 
previously. Finally, in Section |V] we present simulation results for two standardized codes. 
Results are presented for various versions of the algorithm, from the ideal case described in 
Section [TIT] to a version containing all the approximations required to achieve a low-complexity 
circuit implementation. The performance is described in terms of bit error rate, as well as average 
and maximum number of iterations. 
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II. Background 

An LDPC code can be modeled by a bipartite graph Q = (V, C, 8), where elements of C, named 
check nodes (CN), represent the parity-check equations, and elements of V, named variable 
nodes (VN), represent the variables of the parity-check equations. A variable can represent a 
transmitted symbol, or, in the case of punctured codes, an additional unknown that is not part 
of the transmitted information. An edge exists between a variable node and a check node if the 
variable is an argument of the parity-check equation associated with the check node. Throughout 
the paper, we denote the degree of a given variable node by d v , and the degree of a given check 
node by d c . 

We will introduce Algorithm \T\ as a template to be shared by all BP algorithms. This template 
will allow us to define all the algorithms in terms of a VN function VAR and of a CN function 
CHK, with subscripts indicating the algorithm. Let v i G V denote the variable node with index 
i G [l,n], Cj G C denote the check node with index j G and J\f(x) denote the set of 

neighbors of a node x. In Alg. [fl we use Vi = {j : Cj G Af(vi)} and Cj = {i : V{ G Af(cj)}. 
We also denote a message from Vi to Cj by rjij and from c, to t>j by 6j yi . The lNlT(j/j) function 
converts a channel output into the representation domain used by the specific algorithm, and 
similarly 5 is the value corresponding to a probability of \. Finally, for a set S = {si, s 2 , . . . , s z }, 
the expression var(S) is equivalent to VAR(s 1 , s 2 , . . . , s z ), and similarly for CHK. 

In the paper we use the term throughput to refer to the average number of bits processed by 
the decoder per time unit. The BP algorithm terminates as soon as a codeword is found, and 
therefore in the discussions we will assume that the data throughput of the decoder is inversely 
proportional to the average number of iterations until convergence. We also use the term latency 
to refer to the time required to run the decoder for L iterations. 

A. The Sum-Product Algorithm 

The Sum-Product algorithm is a BP algorithm that, for a cycle-free Q, computes the maximum- 
likelihood (ML) estimate of each codeword bit. Because of the cycles contained in an LDPC 
code's factor graph, the Sum-Product algorithm is not guaranteed to converge to the bit-wise 
ML estimates. Nonetheless, it has proven to be a very useful approximation algorithm. The 
Sum-Product algorithm can be expressed in terms of various metrics, notably probability val- 
ues or log likelihood ratios (LLR). The log likelihood ratio (LLR) metric is most often used 
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for implementations because it reduces the quantization error, and lowers the implementation 
complexity by avoiding the need for multiplications. An LLR metric A is defined in terms of a 
probability p as A = ln((l —p)/p)- Following 0], we will use the notation {~Xj} to denote the 
set {Xi,X 2 , . . . , X n } \ {Xi}, where n is given implicitly from the context. In the LLR domain, 
a variable node output A., 1 < i < d v , is given by 

dv 

A[ = var a (~A,) = A + A 3 = A o + J2 A J ~ A *' (1) 

Aje{~Ai} j=i 

where Aj is the input message on edge i and A is the a-priori likelihood obtained from the 
channel. A common expression for computing a check node output A[, 1 < i < d c , is given by 

A- = CHK A (~Aj) = arctanh i [ tanh(A i ) . (2) 

\A J -6{~A i } / 

However, the use of tanh and arctanh is a source of quantization error in implementations. The 
check node function in the Min-Sum algorithm lITOl removes the quantization error and reduces 
the complexity at the cost of an approximation. This check node function is given by 

= CHK MS (~Ai) = min |Aj| JJ sign(A j ), (3) 

Aje{ ~ Al} A ie{ ~A i} 

where sign(x) = 1 if x > 0, and sign(x) = — 1 if x < 0. The approximation can be improved 
by performing a multiplicative or additive correction on the min(-) operation ifTD . [fT2l . With 
a multiplicative correction, the algorithm is known as Normalized Min-Sum (NMS). The NMS 
check node function is given by 

A^ = aCHK MS (~Ai), (4) 

where < a < 1 depends on the code structure. Although the optimal a should also depend on 
the channel signal-to-noise ratio (SNR), this can be ignored in practice 0T). 

B. Stochastic Belief Representation 

The stochastic stream representation expresses belief information by using a random sequence 
of binary messages, where the information is contained in the sequence's mean function. In the 
probability domain, the Sum-Product check node function for n inputs is given by [fT3l 

CHKpfa.ft, ...,p n ) = 1 ~ n:Ll 2 (1 ~ 2 ^ ) . (5) 
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For the stochastic check node function, we let the check node inputs be the random bit sequences 
{Xi(t), . . . , X n (t)}, independent and distributed such that E[Xj(i)] = Pi(t). To evaluate the check 
node function on these stochastic stream inputs, we want the check node function binary output 
Y(t) to be a random sequence with E[F(t)] = CHK p (pi(t), ...,p n (t)). This is satisfied by 

Y(t) = CKK^X^t), X 2 (t), . . . , X n (t)) =X l (t)@X 2 ®...@ X n (t), (6) 

where © represents modulo-2 addition. Note that this is also the function used in Gallager's 
hard-decision decoding algorithms [fT3l . The probability domain SPA variable node function for 
2 inputs is 

VAR p (p 1 ,p 2 ) = - VXV2 . (7) 

(1 -Pl)(l ~P2) +PlP2 

The function can be obtained for more inputs by re-using the two-input function, e.g. VAR p (p 1; 
P2,Pz) = VAR p (pi, VAR p (p 2 ,P3))- If we now let the stochastic variable node function inputs be 
{Xi(t), . . . , X n (t)}, we would like the binary output Y(t) to be distributed such that E[Y(t)] = 
VARp (pi (t) , . . . ,p n (t)). However, there is no memoryless binary-valued function that will achieve 
this. Some functions with memory are proposed in fl5], flSJ. In this paper, we introduce a new 
variable node function that is more accurate, and that can handle an extension of the concept of 
stochastic message to more than one bit. 



III. The RHS Algorithm 

The objective of the RHS algorithm is to achieve high-precision decoding while only relying 
on binary messages. The advantages of using binary messages include the smaller number of 
wires required for transmitting messages, and the low complexity and other interesting properties 
of the binary check node function (Eq. ©), which will be discussed in Section IIII-Cl Note that 
these advantages are not related to the number of bits that are sent on a given edge of the factor 
graph during one iteration, as long as the bits are sent sequentially on the same wire. We can 
therefore introduce an extension of stochastic messages to sequences of binary messages. We 
will refer to these extended stochastic messages as "iteration messages". In an RHS decoder, 
all information is exchanged between processing nodes in the form of stochastic messages, but 
the variable node computation can be performed in an other representation domain. This allows 
computing the variable node function with any desired accuracy. 
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A. Check Node Function 

As is the case for SPA, the RHS check node function takes as input d c — 1 messages, and 
produces one output message. Ideally, for output i this computation would be evaluating rrii = 
CHK p (~pi). However, the messages are constrained to be binary, and the CN instead evaluates © 
one or several times. Parameter k > controls the number of times © is used in one iteration. 
We denote the binary inputs to the CN as X^j, where i G [1, d c ] is the input index, and j 6 [1, k] 
the bit index. The j-th binary check node evaluation for output i is given by 

Yi t j = CHKbin(^i,j, • • • , Xi-ij, Xi+ij, . . . , Xd C) j). (8) 

We will then define a function g^ that estimates the ideal iteration message rrii from the binary 
messages: 

rhi = gk(Y itl ,Y ij2 , ... ,Y ijk ). (9) 

In practice, the variable node circuit receives the sequence {1$ i, Y i>2 , . . . , Y i k ] and evaluates ©, 
but conceptually, it belongs to the check node computation. Note also that this message estimate 
corresponds to a single iteration of the algorithm, and should not be confused with the tracking 
estimator that will be introduced shortly. 

We will denote by M. the image of g k , or the message set. If the check node input messages 
{Xij, . . . , Xd c j} are independent and distributed such that E[Xj j] = p { (independent of j), 
the outputs {Yj,i, Yi t 2, ■ ■ ■ , Y it k} are independent with E[Y^-] = m 8 , and the optimal message 
estimator g^ is the sample mean: 

1 k 

Therefore the check node function is obtained by combining (flOl) and ([8]). Under (flOT ). Ai has 
k + 1 elements. The n-th element of M. will be denoted ji n , < n < k, and defined by fi n = rh 
such that J2j=i Yij = n. 

B. Variable Node Function 

The variable node function for output i takes as input {~mj} G Ai^'^ 1 and the codeword 
symbol prior p a , and generates a binary output sequence {X^i, Xip, ■ ■ ■ , X^}. A functional 
representation of the computation is shown for a single output in Fig. [Q 
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In the stochastic message-passing approach, the error correction performance of a decoder 
is decoupled from the precision of the messages exchanged by interpreting these messages as 
random sequences. In the RHS decoder, we are not interested in using the iteration messages rhi 
directly, but rather in extracting their underlying probability distributions Pi(t). For the sake of 
simplicity, we use a linear tracking function, given by 

Pi(t) = (l-P) Pi (t-l)+Pmi(t), (11) 

where rhi(t) is the new received message, and < f3 < 1 is a real constant. 

Using the estimates p%(t) as input, the standard SPA function is then used to generate an 
intermediate output message p'i(t). The specific function used will depend on the representation 
domain. In the probability domain, the output probability for edge i, p[{t), is given by 

p'.(t) = VAR p (pi(*), . . . , Pi-iit), p i+1 {t), . . .,p dv (t),Po)- (12) 

However, we will see in Section [IV] that the LLR domain is more convenient for the implemen- 
tation. Finally, we generate the binary sequence that will be sent on edge i. This sequence is 
composed of k independent binary random variables X^j such that = p'^t), 1 < j < k. 

This can be implemented by generating k random thresholds Tj, such that Tj is uniformly 
distributed over [0, 1], and constructing as 

fl if p' i {t)>T j , 
Xij=< (13) 
I otherwise. 

C. Properties of the binary check node function 

In an implementation of the RHS algorithm, the binary check node function (Eq. ©) is the 
only operation performed on message bits as they are transmitted between variable node blocks. 
Compared to the check node functions used in SPA or Min-Sum (Eq. [2] to HJ, it has much lower 
complexity since only a modulo-2 sum is required. Furthermore, the various extrinsic outputs of 
a check node can be computed in terms of a total function. For a given CN output Y,i we have 

Yi = CHK bin (CHK bin (Xi, X 2 , . . . , X dc ),Xi). (14) 

This can be used to simplify the implementation by computing only the total Y T = CHK bin (X!, 
X 2 , . . . , Xd c ) and broadcasting this result to all neighboring variable nodes, which then perform 
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Yi = CHK bin (y r , Xi). Finally, the most important property is that (fl4T ) can be factored arbitrarily. 
In a circuit implementation, this allows partitioning the logic in several locations, to provide 
flexibility in the circuit layout. In 01, the ability to partition check nodes was shown to lead to 
large improvements in area, throughput and power. Contrary to @|, check node partitioning in 
RHS requires no approximation and only uses two wires for linking partitions, since a message 
is transmitted on a single wire. The circuit implementation aspects of check-node partitioning 
are discussed further in Section [V] 

D. (3-sequences 

The choice of (fTTT) as a tracking function makes RHS similar to a Successive Relaxation 
SPA algorithm [7], but with the important difference that for RHS rhi and pi are defined on 
different domains. Assuming that the maximum number of decoding iterations is fixed to L, we 
are interested in two performance metrics of the decoder, namely the bit error rate (BER) and 
the average number of iterations. If we constrain (3 to be constant, we can look for the value 
that optimizes some combination of these metrics. We will refer to this value as the optimal 
length-one (3 sequence, denoted (3*. Experimental evidence shows that it depends on k, L, and 
on the channel SNR. This was also observed in for L and SNR. For a given k, L, and SNR, 
a simple way to identify (3* is through Monte-Carlo simulation. Our results show that (3* is 
only weakly affected by SNR, and that in practice this dependence can be ignored. Therefore 
the simulation can be performed at a moderate SNR, thereby ensuring that the computational 
complexity is reasonable. In our Monte-Carlo simulator, we included the ability to record BER 
as a function of the decoding iteration index. The BER can then be plotted in terms of the 
number of iterations, as in Fig. |2a](we will call this "settling curves"), or as a transfer function 
in the style of an error-probability EXIT chart [|T4l . as in Fig. [2b] By superimposing settling 
curves corresponding to different (3 values, one can identify (3* as a function of L, albeit with the 
constraint that (3* is in the set of (3 values simulated. For example, from Fig. |2a]we can determine 
that given [3* e {0.5,0.25,0.15}, 0* = 0.5 for L < 11, and (3* = 0.25 for 11 < L < 50. In this 
example, we have used a cost function that assigns a small but non-zero weight to the average 
number of iterations, such that the optimization is in terms of BER, but ties in BER are settled 
in favor of the faster algorithm. 

We now want to consider making (3 a function of the iteration index t. A parallel can be 
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made with Gear-Shift decoding [9|, which considers how several decoding rules can be used in 
sequence to optimize some performance metric (such as the maximum or the average number of 
iterations) while achieving a given target BER. To find the optimal sequence of decoding rules, 
||9l assumes that these rules have the following two properties. First, that the messages sent from 
variable nodes to check nodes can be described by a one-parameter probability density function 
(PDF). Let c(t) be the PDF parameter at iteration t. The second necessary property is that c{t) = 
f (c(t — 1), c ), where c Q is the parameter of the channel output PDF. Unfortunately, the second 
property does not hold for BP algorithms with memory. For example, in the case of RHS, the 
variable node input message pi(t) in CCD) depends on all received messages {77^(1), . . . , rhi(t)}, 
which becomes clear when unrolling the recursion. 

Our goal is to jointly optimize the BER and the average number of iterations. As a design 
procedure, we propose to initially assume that the second property holds, i.e., that the algorithm 
is memory less. In this case, the best (3{t) is simply the one that minimizes the BER at iteration t. 
Therefore the /3-sequence can be read off the transfer function plot. Note that for a memoryless 
algorithm, the same sequence minimizes both the BER and the average number of iterations. 
Since in reality the second property does not hold, this sequence is only used as a starting point. 
Following flU, we will write the sequences as {(3%, (3%, . . .} to mean that (3\ is used for the first 
a iterations, followed by (3 2 for the next b iterations, and so on. If we constrain the sequence to 
be of the form {/3[, only the parameter I is left to be found, and if necessary it can be 

adjusted to trade off BER for throughput. For example, Fig. |2b] shows the BER transfer curves 
for various values of (3. From the plot we see that (3 = 0.5 is the best choice up to a BER-in of 
3 • 10~ 4 , while for lower BER-in values, f3 = 0.25 is superior. A BER of 3 • 10 -4 is achieved 
in 5 iterations with (3 = 0.5, therefore the best sequence is {0.5 5 , 0.25 L_5 }. The actual BER 
performance of this sequence is shown in Fig. [2aJ Compared with the f3 = 0.25 curve, the BER 
at 50 iterations is slightly degraded, but the average number of iterations is reduced by 33%. 
Note that for L < 50 (and most likely for any L), there are no length-one /3-sequence that will 
yield this combination of BER and average number of iterations. 

E. Lowering the Error Floor 

A convenient way to improve the error floor at the level of the decoding algorithm is to 
consider a two-phase approach. In the first decoding phase, the normal algorithm is used. If 
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there are unsatisfied check node constraints at the end of the first phase, the decoding is known 
to have failed and a modified algorithm is used to attempt to resolve the failure. If the two 
algorithms are similar, the cost in terms of circuit area is kept to a minimum. The two-phase 
approach has been widely used to improve error- floor performance (e.g. lfT5l - lfT71 ). 

The RHS algorithm should readily integrate with most Phase-II algorithms developed for 
SPA decoders since the RHS variable node operations are based on the SPA. To illustrate this 
capability, we will introduce a Phase-II algorithm named "VN Harmonization" that can be used 
both for RHS and for SPA or Min-Sum decoders. Error rate results for this algorithm will be 
presented in Section |V] VN Harmonization has some similarities with the Phase-II algorithm 
presented in ill6l . but has been found to be successful on certain codes for which the algorithm 
of |[T6ll is ineffective, such as the (2640,1320) Margulis code lfT8ll . It also has the advantage 
that it operates locally in each variable node and requires no communication between processing 
nodes. 

For each variable node and each iteration t, the VN Harmonization algorithm performs a 
modification on the set I(t) = {Ai(i), . . . , A dv (t)} of LLR-domain VN inputs. When used with 
the RHS algorithm, these LLR inputs correspond to the LLR-domain input trackers, which will 
be introduced in Section IIV-BI We partition I(t) into I + (t) = {A G I(t) : A > 0}, and 
I-(t) = I(t) \ I+(t). We then define a majority set M(t) as corresponding to the set with the 
largest number of elements among I+(t) and I-(t), with M{t) — if = where 

\S\ denotes the cardinality of set S. The algorithm is described by Algorithm |2] where j is such 
that M(t) = {Aj(t)} when |M(t)| = I, and d > is a constant that must be found empirically. 

IV. RHS Decoder Implementation 

We will now present an efficient implementation for the mechanisms introduced in the previous 
section. By design, the RHS algorithm minimizes the wiring complexity of the decoder by using 
only one wire to transmit a given message. When k > 1, the message bits are transmitted serially. 
In addition, because of the properties of the check node function discussed in Section IIII-Cl the 
topology of the circuit graph can be modified to simplify wire routing. Interestingly, if a check 
node is fully partitioned, the associated logic can be integrated in the neighboring variable node 
blocks, eliminating the check nodes in the circuit graph. The circuit diagram for such a structure 
is shown in Figure |3] The degree of partitioning can be chosen individually for each check 
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node, since it doesn't affect the RHS algorithm. Since the check node function is very simple, 
the rest of this Section is devoted to the variable node function. The LLR domain is attractive 
for implementing the VN function because (OQ) is much simpler than the equivalent probability 
domain function. We now present an approach for implementing the RHS VN function in the 
LLR domain with low complexity. 



A. Variable Node Output 

For each edge i of the variable node, 1 < i < d v , a sequence of random message bits 
{Xi, . . . ,X k } must be generated. This can be achieved by using (fTJt , but we would now like 
to work with LLR values instead of p\. As a result, the thresholds Tj must be generated in 
the LLR domain. To generate LLR thresholds with low complexity, we approximate the natural 
logarithm with a base-2 logarithm, which can be generated using a simple circuit known as a 
priority encoder. A priority encoder takes as input a binary sequence Z and outputs the number 
W of "zero" elements preceding the first "one" element in the sequence. If Z = {Zi, . . . , Z q } is 
generated by a sequence of independent Bernoulli experiments such that Pr(Zj = 1) = ipi, the 
output W E {0, 1, . . . , q} has the following probability mass function: 



Pr(W = w) 



^w+i ]ir=i C- - "00 <w <q, 

nii(i-^) w = q, (15) 

w > q. 



The priority encoder only generates positive numbers, but since the LLR threshold distribution 
is symmetric, the sign bit can be generated separately using a fair random bit S. When quantized 
to integer values, the LLR threshold is expressed as Tj = (— l) s W, for W < q. Pr(W = q) is a 
special case that is linearly related to Pr(VK = q — 1), and therefore the largest LLR magnitude 
that can be generated is \Tj\ — q — 1. When W = q, we can let \Tj\ take a value that is otherwise 
underrepresented by our approximation, e.g. let Tj = (— l) s • 2 if W = q. We now have to find 
{"01 > "02 j • • •} mat b est approximate the true LLR threshold magnitude distribution. If we consider 
again the case of an integer quantization, the probabilities ipi — | and ?/> 2 = 03 = . . . = ip q = \ 
provide a good approximation when q is not too large. In a circuit implementation, sequences of 
fair pseudo-random bits can be generated easily, for example using linear-feedback shift-register 
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circuits. Z\ can be generated by combining two fair random bits, while Z 2 , . . . , Z q only require 
one fair random bit each. 

When deriving the stochastic check node function ©, we assumed that all binary messages 
entering a check node are statistically independent. Furthermore, in (flOl ). we expect {Yi, . . . , Y k } 
to be independent, and therefore, in (fl3T> . the sequence {Ti, . . . , T^} must be independent. To 
achieve this, the number r of independent random numbers required in the entire decoder 
for one iteration is obviously at most r = Nk, where N is the number of variable nodes. 
However, by relaxing the requirement for independence, the decoder can function with much less 
random numbers. For the codes considered in this paper, simulations have shown that a random 
number generator can be shared among 64 VNs without any degradation in performance, that 
is r = Nk/QA. As a result, the circuit implementation can contain only a few Random Number 
Generation modules, and the circuit area occupied by these modules is expected to be negligible. 

In practice, the LLR values are represented on a finite range, and we must consider the 
impact this has on the operation of the decoder. Let this range be [— A cap , A cap ], with A cap > 0. 
If Wi\ < A cap , the finite range has no impact. Therefore let s be the number of check node 
inputs for which |A^| > A cap , < s < d c . We can show that this approximately has the effect of 
changing the mean of Yj, defined in ([8]), such that 

Eft] = cj>{m - \) + i (16) 

where is a scaling factor that depends on A cap and on s: 

2 



gA cap _|_ I 



(17) 



A consequence of (|16l) is that, when s > 0, the message estimator is no longer unbiased as 
defined in (flOl) . and this has an impact on the error rate performance. To have E[m] = m, we 
must replace it with 

This in turn changes the message set Ai, which is no longer a subset of [0, 1]. Similarly the 
codomain of the tracking function ([Til would no longer be [0, 1] and it must be re-defined with the 
appropriate saturation. If we assume that > 1 — |, we have 0</i n < 1 for n = {1, . . . , k — 1}, 
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and we can define the new tracking function as 



0, if rhi(t) = 
Pi{t) = < 1, if rrn{t) = 



// and pi(t 



Hk and pi(t 



l) + (3m i (t) 



otherwise, 



1) < L 



\)>H 



(19) 



whereL = T ^ i-i and H = 1 - L. 




Ideally, the parameter s in (fT7l) would be set to the expected number of saturated check 
node inputs, which depends on the SNR and on the iteration index. However, having to set s 
dynamically would make the decoder too complex, and we will resort to simple heuristic rules. 
At high SNR, a large portion of messages in a BP decoder become saturated^. Therefore, it seems 
reasonable to use s = d c —l. Furthermore, when the d c values are small and A cap is not too large, 
simulation results presented in Section |V] show that such an n can be used at all SNRs with 
almost no performance degradation. However, when the d c values are large, simulation results 
show significant variations in performance between s = and s = d c — 1, and a choice must be 
made based on the application. 

B. Variable Node Input 

At the input of the variable node circuit, the values required to evaluate © are estimated from 
the message stream using (fTTT) (or (fT9l )). However, since we choose to perform the variable node 
computations in the LLR domain, the estimate would need to be converted from the probability 
domain to LLR. There is a more interesting alternative, which is to design a tracking mechanism 
that operates directly in the LLR domain. We will first consider how to achieve this when the 
tracking function is (fTT|) . and will later comment on how the mechanism should be modified if 
(fT9b is used instead. Equation (fTTI) in the LLR domain becomes 



By fixing the message rh G M, we obtain a transfer function that describes the tracker update, 
which we denote /(A; rh). The tracker update is then expressed as A(t) = f(A(t — 1); rh). With 
the message estimator given by (flOl) or (fT8T ). the transfer functions have the following symmetry 

'See QU and (20) for some observations on message saturation in LDPC decoding. 




(20) 
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property: /(A; fij) = —/(—A; Hk-j)* f° r § < j < k. An example is shown in Fig. |4j Therefore, 
the number of transfer functions we need to consider is \_k/2\ + 1. The remaining functions are 
simply obtained (and implemented) using the symmetry property. 

The transfer functions are non-linear in the LLR domain, but we need the tracking circuits 
to have a low complexity. Fortunately, the transfer functions can suitably be approximated by a 
linear function, combined with saturation functions at either or both ends of the linear domain. 
For each value of rh, the steps for deriving the simplified transfer functions are as follows: 

1) Determine the image A of f(A;m). When A is open-ended, that is for some c G R, 
A = [— c, oo or A = — oo, c], we restrict it to a finite interval A = [— c, A^] or A = [— Al, c]. 
Al > is the maximum absolute value that must be represented in the trackers. It depends 
on the code structure and is the same as the maximum value that must be represented in 
an SPA decoder. 

2) For A G A, find the optimal linear approximation aA + b to /(A; rh). 

3) To simplify the circuit implementation, we want a and b to have binary representations 
that are as compact as possible, and therefore the constants are rounded according to this 
criterion. When possible we prefer a = 1. This step requires some simulation of the decoder 
to determine how much the constants can be rounded. 

Example 1: Fig. |4] shows the transfer functions of the LLR-domain estimator for the case of 
2-bit messages (i.e. k = 2 in © and (flOl) ) and j3 = 0.15. In this case, the possible message 
estimates are M = {0, §, 1}. For /(A; 0), A = [-1.73, A L ). We notice that the slope of /(A; 0) 
on this range is close to 1, therefore we look for an approximation of the form /(A, 0) = A + b. 
With Al — 15, the mean squared error is minimized by b = 0.206, which we round to b = \. A is 
rounded to [— |, 15]. /(A; 1) is obtained by symmetry, and for rh — |, we get /(A; |) = 0.776 A 
on the domain [—2.5, 2.5], which we round to /(A; |) = |A. 

Ultimately, the tracking circuit can be very simple. In the example above, the tracker value 
can be represented on 7 bits, and the only operations that are required are addition with ±1 
(representing ±|), and multiplication by |, which can be implemented as the addition of two 
shifted values. 

When using s > 0, the transfer functions are similar, except that /(A; /i ) goes to infinity at 
the LLR value corresponding to p(t — 1) = L, and similarly /(A; goes to negative infinity 
at the LLR value corresponding to pit — 1) = H, as shown in Fig. 0] The maximum value to 



March 3, 2013 



DRAFT 



16 



LEDUC-PRIMEAU et at. : RELAXED HALF-STOCHASTIC BELIEF PROPAGATION 



be represented in the tracker will therefore be set to A L = In i=p (A L = 6.29 in our example). 
However, for /(A; jx Q ) and /(A; fj, k ), using range A described above as the linear approximation 
domain results in a poor approximation, since the functions are highly non-linear near and 
—Al, respectively. The domain of the linear approximation must be reduced slightly in order to 
obtain a good fit. If we consider the quantized representation of A(t), the "infinity" values can 
be handled by simply assigning a special meaning to the largest positive and negative values. 
We then re-define (OQ) to take into account this special meaning. We first define a saturation 
indicator Si as Si = 1 if A* = A L , Si = -1 if A; = -A L , and Si = if \Ai\ < A L . When 
Ss-e{~Si} — 0, the output A- is given by © as usual. Otherwise, if ^2 S ^ s .y $j > 0, we 
set A^ = A cap , and if 52 Sje{niSi y Sj <0, A' { = -A cap . 

The proposed linear approximation of the LLR-domain estimator can also support an efficient 
implementation of the /3-sequences introduced in Sect. IIII-DI In the case of the example above, 
using multiple values for b was found to provide a throughput advantage comparable to using 
/3-sequences in a decoder that uses (fTT|) directly. 

V. Simulation Results 

To test the performance of the RHS algorithm we consider two standardized codes. The first is 
taken from the IEEE 802. 3an standard. The code structure is based on a shortened Reed-Solomon 
code (RS-LDPC) flU. The code has length 2048, rate 0.8413, and is regular with d v = 6 and 
d c = 32. The second code has been standardized in [|22l . It has length 2048, with an additional 
512 punctured variable nodes, rate 1/2, and d v E {1,2,3,6}, d c E {3,6}. The code design is 
known as Accumulate-Repeat-4- Jagged- Accumulate (AR4JA) Il23ll . 

The performance of the various decoding algorithms is measured using software Monte-Carlo 
simulations executed on a parallel computing platform. Following the discussion in Sections Hn] 
and UYl we will present results for various levels of idealization of the variable node tracking 
functions. We will refer to the use of (TTTT > as "floating-point (FP) probability tracking". The 
second step towards implementation is to use linearized LLR-domain tracking (as described in 
IIV-BI) with full-precision parameters, referred to as "optimal linear tracking". Note that even in 
this case, we constrain the fj, and fj, k solutions to be of the form A(t) = A(i— 1) + 6, such that the 
corresponding circuit is simply an adder. The last step is to round the linear tracking parameters 
to reduce the circuit implementation complexity. This is referred to as "rounded linear tracking". 
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For the IEEE 802. 3an code, we went further and implemented the trackers with integer data 
types and quantized channel outputs, to mimic the operation of a circuit implementation. 

An RHS decoder has two design parameters. First, the iteration limit L should be chosen 
based on the latency requirement. The other parameter is the number k of bits per message. 
Figure |5] shows the effect of k on the frame error rate for the two codes simulated. A larger 
k improves the error rate for a given L. Since the number of pre-determined linear operations 
that must be supported in the tracker circuit is \_k/2\ + 1, the implementation complexity will 
generally grow with k. However, (3* and the corresponding b are increasing functions of k, and 
a larger b can have the effect of decreasing the number of quantization bits required in the LLR 
tracker. Since message bits are transmitted serially, a larger k also increases the circuit latency 
associated with message transmission, but for small values of k, the variable node circuit latency 
is expected to be dominant. Parameter (3 is optimized as described in IIII-Dl with BER as the 
optimization target (unless mentioned otherwise). The variable node output LLR range, A cap , is 
chosen as small as possible. We use A cap = 8 for the IEEE 802. 3an code, and A cap = 6 for 
the AR4JA code. We will first discuss in Section IV-AI the maximum BER performance of the 
RHS algorithm, that is the BER when L is chosen such that any increase provides a negligible 
BER improvement. Then, in Section IV-Bl we consider how RHS performs when the focus is on 
decoding latency and throughput. 

A. Error Correction Capability 

The BER results are shown in Fig. [6a] and [6b] for the RS-LDPC code, and in Fig. [7] for the 
AR4JA code. We show the error rate achieved by floating-point SPA and NMS as a reference. For 
NMS, parameter a in © is set to a = 0.5 for the RS-LDPC code, and a = 0.75 for the AR4JA 
code. As seen in the figures, RHS can match the performance of FP SPA, but also outperforms 
it by a significant margin on the RS-LDPC code. This superior performance on the RS-LDPC 
code can be attributed to the successive relaxation iterative dynamics that are a consequence of 
(fTTI) . In fact, we have verified that the RHS curve with IK iterations and FP probability tracking 
shown in Fig. [6a] exactly matches the BER of Successive Relaxation SPA (7]|. 

Both codes considered are affected by message saturation effects at low error rates, which can 
cause error floors. For the RS-LDPC code, the connection between the error floor and message 
saturation is well documented in [20]. For the AR4JA code, we have observed that when decoding 
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with FP NMS, enforcing a limit on LLR messages causes a floor. For example, when limiting 
LLR values to the range [—16, 16], the BER of FP NMS never goes below 1CT 9 . Without the 
saturation limit, no floor is observed. Because of the message saturation effects, for the AR4JA 
code we use (fT9l) as the basis for variable node message tracking, with s = d c — 1 in (TT71) . In the 
waterfall SNR region, we have observed little difference in BER between s = and s = d c — 1, 
which motivates the use of the latter. We can see in Fig. [7] that no floor is observed on either FP 
NMS or RHS. For the RS-LDPC code, RHS has an error floor that is comparable with quantized 
NMS. On this code, we have a solution available to address floor performance that has low 
complexity and is very effective, namely the VN Harmonization algorithm that was introduced 
in IIII-EL This solution is therefore preferred over the use of s > 0, especially since we have 
observed that for the RS-LDPC code, using s > degrades the BER in the waterfall region. 
We present some curves that use a decoding Phase-II with VN Harmonization at specific SNRs, 
denoted by (*), (**) and (***) in Fig. [6a] and [6b] The parameter d in Alg. [2] is set to 0.3. For 
the other RHS curves in Fig. [6a] no Phase-II is used because the resulting BER would require 
too much computing time to be simulated. 

As expected, the best BER performance is obtained when using (|TT|) implemented with floating- 
point operations. We first want to observe the impact of the linear approximation to LLR-domain 
tracking. We can see that for both codes, the "optimal linear tracking" curves are close to FP 
probability tracking. We then consider the BER performance when low-complexity parameters 
are used for the LLR-domain tracking. To give an example of how we can expect the algorithm 
to perform once implemented in hardware, we show a simulation of the RS-LDPC code where 
the software implementation uses integer data types and the channel outputs are quantized on 
4 bits. The specific parameters used were presented in the example of Section IIV-BI The BER 
achieved by this simulation ("rounded linear tracking") shows approximately a 0. 1 dB gain over 
NMS with 4-bit inputs, similar to the difference observed between "optimal linear tracking" RHS 
and FP NMS. On the AR4JA code, the "rounded linear tracking" curve uses the same software 
implementation as "optimal linear tracking", but with rounded parameters. For the AR4JA code, 
the BER curves are shown for k — 4, and therefore 3 different linear calculations must be 
implemented in the tracker. Since we choose s = d c — 1 in (fTTT) . /(A; in) will depend on the 
degree d c of the check node generating the message rh. The AR4JA code has d c E {3,6}. 
However, after rounding the parameters, the transfer functions are the same for both check 
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node degrees, with the exception that A L = 6.25 for d c = 3, and A L = 5.5 for d c = 6. The 
transfer functions used for the "rounded linear tracking" result are as follows: /(A; /i ) = A + |, 
-1 < A < A L ; /(A; /^) = |A + |, -2 < A < f ; /(A; ^ 2 ) = |A, -2 < A < 2. The functions 
are saturated outside the ranges specified. As was the case for the RS-LDPC code, these tracking 
functions have a low implementation complexity. Furthermore, A(t) can be represented on only 
6 bits. 

B. Throughput and Latency 

If we consider the iteration limit required for RHS to achieve the same performance as 
Normalized Min-Sum, we expect a higher number of iterations to be needed for RHS, because 
more information is exchanged in one iteration of NMS. The relationship between a number 
of iterations and actual time depends on the circuit latency associated with one iteration. This 
in turn depends on the amount of sequential logic involved in computing the messages, and on 
the wire delays, which often cannot be neglected in fully -parallel implementations. Furthermore, 
the metric of interest is usually not simply the latency, but the latency normalized to circuit 
area. Because of its very simple check node update operation, RHS has less sequential logic per 
iteration. Furthermore, its ability to be partitioned arbitrarily improves the wire delays and area 
requirements of the circuit implementation. With NMS, check nodes cannot be easily partitioned 
to improve the circuit implementation efficiency. An approach for doing so is proposed in 0), 
but it has the effect of degrading the decoding performance. The curve taken from [4] appearing 
in Fig. [6b] shows the BER when the check nodes are split into 16 partitions, with a limit of 
1 1 iterations and 5 bits per message. The figure shows that better performance is achieved by 
RHS with a limit of 30 iterations, that is, with a comparable number of bits being exchanged. 
Additionally, the BER performance is not tied to check node partitioning. For example, the 
implementation can feature fully-partitioned check nodes (layout as in Fig. |3]) while achieving a 
BER that is about three orders of magnitude better than the "Split-Row- 16" curve, if a latency 
of 1000 iterations can be tolerated. 

As presented in lIII-D[ the average number of iterations required for convergence can be reduced 
by using multiple (3 values in sequence. For the RS-LDPC code, we found that despite exchanging 
only 2-bit messages, and despite the simple check nodes, the average number of iterations can 
be reduced to 3.46 (at 4.6 dB) by using (3 = {0.5 5 , 0.25 L ~ 5 }. The RHS curve with L = 100 
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iterations in Fig. [6b] uses this j3 sequence. In comparison, the NMS algorithm requires an average 
of 2.5 iterations at 4.6 dB, and therefore RHS requires only 38% more iterations. Using the same 
code, the authors of [4] have reported a 3.3 x improvement in throughput from the use of check 
node partitioning. In addition, RHS can exchange less bits per iteration, and the check node 
circuits are much simpler. Therefore, in a fully-parallel circuit implementation, we expect the 
RHS algorithm to provide a higher nominal throughput (without considering area requirements) 
than NMS on this code. Considering throughput normalized to circuit area should further increase 
the advantage of RHS. On the AR4JA code, we have observed that using /3 = {0.5 26 , 0.25 L " 26 } 
provides a 28% throughput improvement (at 2.5 dB) over (3 = 0.25. However, the average number 
of iterations remains 2.6 x higher than for NMS. Therefore, RHS might not provide a nominal 
throughput advantage on this code. However, because of the reduction in area provided by binary 
message passing and check node partitioning, RHS could still have an advantage in terms of 
throughput/area, although this would need to be confirmed at the implementation level. 



VI. Conclusion 



We introduced a binary message passing decoding algorithm for LDPC codes that simplifies 
the wiring and the layout of fully-parallel circuit implementations, while being able to achieve 
an error rate that is equal to or better than the well known Sum -Product and Normalized Min- 
Sum algorithms. To demonstrate the practicality of the RHS algorithm, we presented a low 
complexity implementation, as well as simulations results that show that the bit error rate 
performance remains good even at low decoding latencies. In addition, we introduced the (3- 
sequence method for reducing the average number of iterations with a negligible impact on 
implementation complexity, and we described an algorithm for resolving some decoding failures 
in the error floor region, named "VN Harmonization". This algorithm only requires additions 
with a pre-determined constant, operates locally in each variable node, and can be used with any 
BP algorithm that uses the LLR representation for variable node computations. 

Our experimental results suggest that, at least for certain codes, the RHS algorithm provides 
a significant gain over existing algorithms in data throughput normalized to circuit area. This 
throughput gain can also be traded off for increased power efficiency. 
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Fig. 1. Functional diagram of the RHS VN computation. For simplicity, only the output associated with edge i = 1 is shown. 
An iteration index t is implied for all variables, except the prior p . Jilt and J 1 3b refer to the respective equations. 
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Fig. 3. Circuit diagram of a fully-partitioned check node. Circles represent variable node circuits. 
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Fig. 5. Frame error rate in terms of the maximum number of iterations. Solid curves are for the RS-LDPC code, at 4 dB, 
and dashed curves are for the AR4JA code, at 2 dB. The RHS algorithm is simulated with floating-point probability tracking 
(Eq. ED and various message precisions, and compared with floating-point NMS. 
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(b) 

Fig. 6. Bit error rate performance on the RS-LDPC code. Results for quantized decoders are shown with dashed curves. All 
RHS curves use k — 2 bits per message. (*) At 4.6dB a 50-iteration Phase-II is used after Phase-I. (**) At 4.6dB a 25-iteration 
Phase-II is used. (***) At 4.8 dB a 30-iteration Phase-II is used. The Split-Row curve (4) uses 16 partitions and 4-wire links 
between partitions to handle the Threshold mechanism, and 1 1 iterations. 
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Fig. 7. Bit error rate performance on the AR4JA code. RHS uses k = 4 bits per message. 
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ALGORITHMS 



Algorithm 1 BP Template 
procedure BP(y) 

<- init(^), Vi 

5, Vi, j 
for t 1 to L do 
for i 1 to n do 
for all j e Vi do 

riij <- VAR({Ci} U {e aii :aeVi}\ {6 jyi }) 

for jV 1 to m do 
for all % e Cj do 

e jti <- CHK({r] ad :aeCj}\ {q itj }) 

Compute the decision vector x(£) 
Terminate if x(i) is a valid codeword 

Declare a decoding failure 
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Algorithm 2 VN Harmonization 



if \M{t)\ = 1 then 

for % «— 1 to d v , i ^ j do 
if Aj-(i) > then 

Ai{t) <- Ai{t) + d 
else 

Ai(t) <- Ai(t)-d 
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